Course: Applied Geo-Data Science at the University of Bern (Institute of Geography)

Supervisor: Prof. Dr. Benjamin Stocker

Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider

Further information: https://geco-bern.github.io/agds/

Do you have questions about the workflow? Contact the Author:

Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)

Matriculation number: 20-100-178

Reference: Report Exercise 6 (Chapter 9)

1 Introduction

1.1 Objectives

This exercise is an introduction to supervised machine learning. With a realistic problem, the following goals should be trained:

  • Implementation of the KNN-algorithm

  • Discuss the bias-variance trade-off

  • The role of the hyper-parameter k

1.2 Theory

Supervised machine learning is a type of machine learning where the model is trained using labeled data to predict new and unseen data as accurately as possible (with the lowest error in the new data). But this approach has some challenges. For example, we need to know how good the quality of the predicted data is. Therefore, we use only a part of the labeled data for training. With the other part of the labeled data, we quantify the error. The KNN-algorithm (k nearest neighbors) will be trained to predict new data. The basis to calculate a model is distance between values of the variable. The two most important distance metrics for KNN are the euclidean distance and the Manhattan distance. With the hyper-parameter k we can fit our model. But what is the hyper-parameter k ?:

Euclidean distance:\[ \left(\sum_{j=1}^P (X_{a,j} - X_{b,j})^2\right)^{\frac{1}{2}} \]Manhattan distance:\[ \sum_{j=1}^P |X_{a,j} - X_{b,j}| \]

Consider the following: With k = 1, we use only one neighbor and we can easily hit every target value in the training data exactly. The bias in the training data will be zero. Unfortunately, it would also include all errors. In this exercise, the target (GPP) and the predictors come from measurements and have always errors (always statistical and almost always systematic errors (measurement noise)). If we apply the model to predict new data, then we would project the error into the new data. That will cause a higher variance in the predicted data because our model is overfitted. The bias-variance trade-off in the train data shifts to the variance.

If we use a k = n, we will include many neighbors to calculate the target. The model hits almost none of the target in the training data. The prediction will not be satisfied. The RSME (root-square-mean-error) is high in the training data and will be high in the test data. We call that an underfitted model, and the bias-variance trade-off shifts to a very high bias.

The optimal k (prediction with the lowest error) will be between this two extreme scenarios. But where is not so easy to answer. The goal is to find the model with the best generalisability, which is the model with the probably best bias-variance trade-off. This exercise shows you how we can find the best model with a KNN-algorithm.

2 Methods

We use the KNN-algorithm and compare the model to a multivariate linear regression model. We split our data and use 70% for training and 30% for model evaluation. Because the algorithm uses distance, we have to standardize our data. With the box-cox function, we transform our data into a normal distribution. We use set.seed for reproducibility.

2.1 R: Evaluation of the data

The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.

3 Programming and data evaluation

3.1 Packages

The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.

source("../../R/general/packages.R")

3.2 Read the file

First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.

name.of.file <- "../../data/re_ml_01/daily_fluxes.csv"

# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
  
  # Access to the data
  url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data
  /FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
  
  # Read in the data directly from URL
  daily_fluxes <- read.table(url, header = TRUE, sep = ",")
  
  # Write a CSV file in the respiratory
  write_csv(daily_fluxes, "../../data/re_ml_01/daily_fluxes.csv")
  
  # Read the file
  daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")
  
  # If exists such a file, read it only!
  }else{daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")}

3.3 Data Overview

3.3.1 Missing values

If we take a closer look at the file, we see that there are 334 variables. We can not continue with our usual procedure because there are too many variables. That is why we have to clean our data first.

3.3.2 Data cleaning

We select only the columns we need. Further, we can see that some columns contain -9999 as a value. Our use.good.quality.only-function changed that to NA. Then we use ymd() from the lubridate package to rewrite the date in a proper way. Further, we want only columns which contain good quality. For that, we check selected columns with their quality control column. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). After that, we discard all quality-control columns. Now we can see that the variable LW_IN_F (long-wave radiation) contains about 56% NAs. This is very problematic because for model calculation we will drop all rows which contains NAs. We would lose about 56% of our data. We also can to this because we know from “re_stepwise” that the influence of long-wave on GPP is negligible. Further, the variable TIMESTAMP contains the date. But we are not sure if we can standardize time with box-cox. That is why we do not use these variables. All other variables have good quality.

Attention: unlike here, the variables PA_F and P_F are not used in the AGDS script. Therefore, the results will be significantly different, for example RMSE, RSQ or optimal k.

# Load the function in the file
source("../../R/re_ml_01/function.use.good.quality.only.R")

# Function call to clean the data
daily_fluxes <- use.good.quality.only(daily_fluxes.raw)

# Load the function
source("../../R/general/function.visualize.na.values.R")

# Function call
visualize.na.values.without.groups(daily_fluxes, c("Problem with long-wave radiation!"))

3.4 Implementation of the KNN algorithm

3.4.1 Split the data

Here we split our data into a training subset and a test subset (70% training and 30% test). After we split the data, we can calculate our model. We use all variables except TIMESTAMP and LW_IN_F. For KNN, we use k = 8 as a first try. This choice is random. The last code chunk in this workflow will show you which k is the optimal one (we do not want to spoil anything here). That is why you should use k = 8 first.

# For reproducibility (pseudo-random)
set.seed(123)  
# Split 70 % to 30 % 
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")

# Split the data in a training set
daily_fluxes_train <- rsample::training(split)
# And a test set
daily_fluxes_test <- rsample::testing(split)

# Load the function in the file.
source("../../R/re_ml_01/function.train.and.test.R")

# Function call for KNN with k = 8
mod.knn <- knn.model(daily_fluxes_train, 8)

# Function call for lm
mod.lm <- lm.model(daily_fluxes_train)

3.4.2 Evaluation

After we split our data, we have to make a model evaluation. The first function call is for the lm-model. The function needs the model, which we have calculated in the code chunk above. This is a multivariate regression model, and therefore, we cannot improve the formula.

The second function call is for the KNN-model. The KNN-model has been calculated in the code chunk above. The number for k was 8. If we want a model with a different k, we have to recalculate the knn-model with the function in the code chunk above.

lm-Model

The orange line has the slope of 45°. If the model predicted all values accurately, all values would lie on this line because the fitted value would be the same as the target. The RMSE would be zero. The bias is in the train data set per definition zero.

The red line is the calculated regression line. It shows that it has a lower angle than the orange line. It follows that our model systematically underestimates the values. We also see that in the negative bias. It also shows that especially high values are more underestimated than smaller ones. If the red line also have a 45° slope in the test data set, then the bias would be zero an we would have a perfect prediction (which is in general impossible).

KNN-Model

The situation for KNN is very similar. The regression line is a little steeper than that of the lm-model. This means that KNN underestimates the values less than the lm-model. We also see that in the lower bias than for the lm-model. But also KNN tends to underestimate high values more than small values.

# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")

# Function call for linear regression model
eval_model(mod = mod.lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (lm)"), c("Davos (lm)"))

# Function call for KNN
eval_model(mod = mod.knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (knn: k=8))"), c("Davos (knn: k=8)"))

3.4.3 Visualization

Here we can see how the KNN-model depends on k. If we use k = 1, then we have perfect training data (the regression line has a 45° slope and all values lie on it. RMSE and bias are zero). But the RMSE in the test data is high (the model is overfitted). If we use k = n our model would be underfitted, and therefore the RMSE in the test data would also be high. For a visualization of the development of RSQ and RSME as a function of k, please take a look at the last visualization (discussion part). For almost all k the bias is negative which means our models underestimate the target.

# Load the function into the file
source("../../R/re_ml_01/function.different.k.R")

# Define a sequence for k
my.sequence <- c(1, 8, 15, 50, 100)
# Function call
different.k(my.sequence,daily_fluxes_train,daily_fluxes_test,c("Davos"),c("Davos"))

## k-Nearest Neighbors 
## 
## 1910 samples
##    8 predictor
## 
## Recipe steps: BoxCox, center, scale 
## Resampling: None

4 Discussion

In this short discussion part, we want to show you two main aspects of this exercise and discuss each aspect shortly. First, we need to know how to judge the models. For this, we need to know why one model can be considered better than another. How can we estimate the bias-variance trade-off? Second, we need to know how the KNN-algorithm works. For this, we need knowledge about the role of the hyper-parameter k.

Bias1:

Erroneous assumptions in the learning algorithm lead to a bias. High bias can cause an algorithm to miss the relevant relations between features and target outputs.

Variance1:

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs.

bias-variance trade-off1:

Both bias and variance are connected to the model’s complexity. Low complexity means high bias and low variance. Increased complexity means low bias and high variance. Ideally, we want a model with a low bias and a low variance. But this is often impossible and therefore we need the best trade-off. We describe our model with the terms under- and overfitting. Normally, underfitting implies high bias and low variance, and overfitting implies low bias but high variance.

4.1 Model discussion and the role of the bias-variance trade-off

4.1.0.1 Why is the difference between the evaluation on the training and the test set larger for the KNN-model than for the linear regression model?

lm-model:

A lm-model uses all predictors to calculate a global multivariate regression model (see re_stepwise). However, it is a linear regression. If we use the equation (see theory), then we made some assumption (linearity, normality, homoscedasticity and independence).

Obviously, a lm-model does not looks for any underlying pattern. It creates simply a linear combination with all available variables. Further, we can not optimize lm-models like KNN-models. The lm-model describes the train data with all the metrics (bias, variance, or RMSE etc.). It is therefore logical that all these metrics are projected into the new data. These metrics are then relatively similar in the test data as in the train data.

KNN-model (k = 8):

The situation for KNN is different. KNN uses a “distance”. KNN does not necessarily use all predictors. It uses k neighbors to calculate a local regression. The KNN-model looks for an underlying pattern. If we optimize the KNN-model due to the hyperparameter k, then we change the metrics in the train data. We can overfit our model and get a low bias but a high variance, or we can underfit our model and get a higher bias but a lower variance.

So we can play around with these metrics, but you see that there is a trade-off between the bias and the variance. We can neither predict well with an overfitted model nor with an underfitted model. Due to this trade-off, the metrics change in the test sets for the KNN-models more than for lm-models.

4.1.0.2 Why does the evaluation on the test set indicate a better model performance of the KNN model than the linear regression model?

We see that KNN-model has higher RSQ than lm (0.66 vs. 0.61) and lower RSME (1.51 vs. 1.63) which means KNN is the better model than lm. But be careful: RMSE and RSQ in the KNN-model depends directly on k. We must optimize k first!

4.1.0.3 How would you position the KNN and the linear regression model along the spectrum of the bias-variance trade-off?

The KNN-model (k = 8) has a bias of -0.06 and lm-model has bias of -0.1 which means the lm-model has a lower model complexity than KNN-model. It follows that lm-model is more on the bias side than KNN-model. We also see that in RSQ. It is in KNN a little bit higher than in lm and therefore a bit less unexplained variance.

But now we know, that optimal k is 15 (spoiler alert!). Therefore, KNN with k = 8 has not the best trade-off and it is also on the bias side (but lesser than lm-model). The other KNN-models (k =50 and k = 100) are on the variance side in the trade-off.

4.1.0.4 Visualize temporal variations of observed and modeled GPP for both models, covering all available dates.

lm-model:

First, there is a annual cycle. We also see that the prediction varies in quality. There are years with a lower residue than other years. The model underestimate the target for a certain season and overestimate for the opposite season. We want to know more. We plot the residue as a function of the day of the year. Now we see, that our lm-model underestimate the target in winter and overestimate it in summer. Because the target is GPP (Gross Primary Productivity) it makes totally sense.

KNN-model:

We make the same thing for the KNN-model. The results is very similar to the lm-model. The model underestimate also the winter target and overestimate the values during summer.

# Load the function into the file
source("../../R/re_ml_02/function.vis.residue.R")

# Function call for KNN with k = 8
mod.knn.optimal <- knn.model(daily_fluxes_train, 15)

# Residue for Davos with lm (year)
time.variation.year(mod.lm, daily_fluxes_test,
                    c("multiple linear regression (lm)"), 
                    c("re_ml_1 (Chapter 9)"))

# Residue for Davos with lm (day of the year)
time.variation.year(mod.lm, daily_fluxes_test,   
                    c("multiple linear regression (lm)"), 
                    c("re_ml_1 (Chapter 9)"), annual = FALSE)

# Residue for Davos with KNNn (year)
time.variation.year(mod.knn.optimal, daily_fluxes_test,
                    c("KNN with: Train = Davos, Test = Davos, k = 15"), 
                    c("re_ml_1 (Chapter 9)"))

# We want a better resolution of the residue of Davos (daily). We use plot = false to return a tibble. After that, we can pipe it
time.variation.year(mod.knn.optimal, daily_fluxes_test,   
                    c("KNN with: Train = Davos, Test = Davos, k = 15"), 
                    c("re_ml_1 (Chapter 9)"), annual = FALSE)

4.2 Discus the role of k and generalisability of a model

The k in KNN is a hyper-parameter that refers to the number of nearest neighbors. To quantify the “errors”, we use MAE instead of RMSE. because it is more related to the variance. Although the values are different, their flows are synchronous. That means that both MAE and RMSE are hyperbolas and have their global minima at the same k.

Hypothesis:

“The lower we chose k, the higher the RSQ in the training data, but the higher the variance in the test data.”

Explanation: Let k = 1. Then the training data would be perfectly predicted because only one
(the nearest) neighbor would be taken into account. We would hit every target value. The bias would be zero, and therefore RSQ = 1, and the MAE is 0 as well. But the model would be totally overfitted. That means there is a bias-variance trade-off. If the bias in the training data is low, then the variance in the test data is, in general, high.

Let k = n. Then it would follow: \(RSQ\in(0,1]\) and \(MAE\in[0, \infty)\). But we do not know where it is. However, the model tends to be underfitted because it takes all neighbors into account. Therefore, the prediction would be the mean value of the observations (worst case).

Conclusion: We want a model with zero bias and zero variance. But this seems impossible, so we have to find the model with the best bias-variance-trade-off.

4.2.0.1 Showing model generalisability as a function of model complexity and optimal k

The hypothesis is not fully true. RSQ is a quantity to describe the variance. Or better, it describes how much of the variance explains the model. For the best bias-variance trade-off, we want k where the error in the test data is lowest (lowest MAE).

The model for k = 1 is totally overfitted. In the training data, RSQ is 1 and it follows that MAE (or RSME) must be 0. But in the test data, MAE is at a global maximum. The model is useless for any predictions.

If we slowly increase k, then increase MAE and decrease RSQ on the train set. In the test data RSQ increase and MAE decrease. If we exceed optimal k, then RSQ decrease and MAE increase in the test set. We see, there mus be a optimal k.

For k = 100 the RSQ is lowest in the training data. The MAE increased for both data sets but is not at a maximum. The model is not suitable for predictions because the trade-off is on the side of too much bias.

Long story short: If k is to the left of optimal k, then the model is overfitted. If k is to the right of optimal k, then it is underfitted.

Generalisability means that we want the model with the lowest MAE in the test data. For that, we mark the minimum of the RMSE test data hyperbola (green point = 15). Important: Here we set a seed and split our data only pseudo-randomly. If we do not use set.seed(123), then it would be another optimal k.

# Load the function into the file
source("../../R/re_ml_01/function.parameter.extracter.R")

# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))

# Visualize the MAE and RSQ --> optimal k is 15
parameter.extracter(my.sequence, daily_fluxes_train, daily_fluxes_test, 
                    c("re_ml_01 (Chapter 9"))